Libraries
These are the libraries to load
pkg <- c('pivottabler', 'tidyverse','tibbletime','anomalize','timetk')
#install.packages(pkg)
library(pivottabler)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tibbletime)
library(anomalize)
library(timetk)
library(forecast)
Read the revenue and AdSpend
We check the date type of the ‘date’ column and change the type to
DATE
The CSV data is avaialble in the github repository
setwd("/Users/gerardogandara/Documents/recast/")
## Read file as CSV
df_spend <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_spend.csv')
df_revenue <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_revenue.csv')
# Convert char format to DATE format
df_revenue$date <- as.Date(df_revenue$date, format = "%m/%d/%y")
df_spend$date <- as.Date(df_spend$date, format = "%m/%d/%y")
# Check the format and content of each dataframe
str(df_revenue)
'data.frame': 1035 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ date : Date, format: "2020-01-01" "2020-01-02" "2020-01-03" ...
$ revenue_dtc : num 16000 19306 22052 23177 26169 ...
$ revenue_amazon : num 50000 60332 68914 72428 81778 ...
$ revenue_walmart: num 34000 41026 46861 49251 55609 ...
str(df_spend)
'data.frame': 10350 obs. of 4 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ date : Date, format: "2020-01-01" "2020-01-01" "2020-01-01" ...
$ channel: chr "facebook_prospecting" "facebook_retargeting" "google_branded_search" "google_nonbranded_search" ...
$ spend : num 4154 2679 474 3817 7421 ...
Check there are not NULL values
paste("Total of NULL in Spend:", sum(is.na(df_spend)))
[1] "Total of NULL in Spend: 0"
paste("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))
[1] "\nTotal of NULL in Revenue: 0"
paste("\n\nSummary\n")
[1] "\n\nSummary\n"
# you can check with summary as well and see the columns
summary(df_spend)
X date channel spend Year
Min. : 1 Min. :2020-01-01 Length:10350 Min. : 0 Min. :2020
1st Qu.: 2588 1st Qu.:2020-09-15 Class :character 1st Qu.: 1278 1st Qu.:2020
Median : 5176 Median :2021-06-01 Mode :character Median : 5268 Median :2021
Mean : 5176 Mean :2021-06-01 Mean : 7740 Mean :2021
3rd Qu.: 7763 3rd Qu.:2022-02-15 3rd Qu.:11649 3rd Qu.:2022
Max. :10350 Max. :2022-10-31 Max. :36401 Max. :2022
summary(df_revenue)
X date revenue_dtc revenue_amazon revenue_walmart
Min. : 1.0 Min. :2020-01-01 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 259.5 1st Qu.:2020-09-15 1st Qu.: 41643 1st Qu.:130216 1st Qu.: 88547
Median : 518.0 Median :2021-06-01 Median : 72051 Median :225161 Median :153109
Mean : 518.0 Mean :2021-06-01 Mean : 72741 Mean :227062 Mean :154488
3rd Qu.: 776.5 3rd Qu.:2022-02-14 3rd Qu.: 97340 3rd Qu.:303352 3rd Qu.:206279
Max. :1035.0 Max. :2022-10-31 Max. :193005 Max. :415952 Max. :282848
sum(is.na(df_spend))/nrow(df_spend)*100
[1] 0
sum(is.na(df_revenue))/nrow(df_revenue)*100
[1] 0
Since we have 1,035 rows, we can remove because that represents less
than 1% or just replace with zero
Now, let’s replace with zero since it is a time series
df_revenue[is.na(df_revenue)] <- 0
df_spend[is.na(df_spend)] <- 0
cat("Total of NULL in Spend:", sum(is.na(df_spend)))
Total of NULL in Spend: 0
cat("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))
Total of NULL in Revenue: 0
1- Which channel had the largest increase in spend in 2022 compared
to the same date range in 2021?
Using lubridate to filter the year of the date, also use the
pivottabler library to group/aggregate it
Answer: online_video
# print the first value of the dataframe
paste("The channel with highest LIFT is:", row.names(df_summary_final)[1] )
[1] "The channel with highest LIFT is: online_video"
2- In terms of total revenue, are there any anomalous days?
Group by day to see the trend and plot
df_revenue_total <- df_revenue %>%
mutate(total_revenue = ( df_revenue$revenue_dtc + df_revenue$revenue_amazon + df_revenue$revenue_walmart ))
# Group by sum using R Base aggregate()
agg_df <- aggregate(df_revenue_total$total_revenue, by=list(df_revenue_total$date), FUN=sum)
colnames(agg_df) <- c('date','revenue')
#Plot sales over 36 months
ggplot(agg_df, aes(x=date, y = revenue)) + geom_line()

paste("Visually we can see some outliers")
[1] "Visually we can see some outliers"
Now, we can use the anomalties library to identify the values
Use time_decompose() to decompose a time series prior to performing
anomaly detection with anomalize(). Typically, anomalize() is performed
on the “remainder” of the time series decomposition.
The return has three columns: “remainder_l1” (lower limit for
anomalies), “remainder_l2” (upper limit for anomalies), and “anomaly”
(Yes/No).

We can print the detail of the anomalies, for example the last 4
tail(df_anomalized %>%
filter(anomaly == 'Yes'),4)
NA
You can also use the timetk package for dynamic plot
agg_df %>% timetk::plot_anomaly_diagnostics(date,revenue, .facet_ncol = 2)
frequency = 7 observations per 1 week
trend = 92 observations per 3 months
3- In which month of the year does Acme tend to make the most
revenue?
We can compute month column before we aggregate and sort to get the
popular month
Most profitale month in a year is the month number: 3
# Calculate the month column
monthfunction<-function(dataframe, datecolumn) {
month(dataframe[,datecolumn])
}
agg_df$Month <- monthfunction(agg_df, "date")
# group by month
by_month <- aggregate(agg_df$revenue, by = list(agg_df$Month), FUN = sum)
colnames(by_month) <- c('month','revenue')
by_month <- by_month[order(by_month$revenue, decreasing = TRUE), ]
# print the first value of the dataframe
print(by_month)
paste("Most profitale month in a year is the month number:", row.names(by_month)[1] )
[1] "Most profitale month in a year is the month number: 3"
4- Does Acme’s marketing spend tend to follow a similar pattern to
revenue?
We need to merge both dataframes to have the TS of Revenue using
AdSped as regressor Convert the dataframe to TS then plot, then plot the
results
Visually we can confirm they have a similar pattern.
#select the date and total_revenue of Acme
df_only_revenue <- select(df_revenue_total, date, total_revenue)
#aggregate spend of all media
df_spend_total <- aggregate(df_spend$spend, by=list(df_spend$date), FUN=sum)
colnames(df_spend_total) <- c('date','spend')
df_rev_spend <- merge(df_only_revenue, df_spend_total, by = "date")
#Convert dataframe to time series object using the ts() function
acme_ts <- ts(data = df_rev_spend[,c(2,3)])
# Time plot of both variables
autoplot(acme_ts, facets = TRUE)

5- Based on what you see here, could you make the case that
marketing spend is causally related to revenue? Why or why not?
Visually we can see the effect has high correlation, and when we
calculate the R2 is 0.91 with p=value basically zero. So, it is
statiscally correlated.

Is the relation linear? Yes, form the plot above, the relationship is
linear.
now we jsut need to verify that the data from each of the 2 variables
(revenue, spend) follow a normal distribution.
df_rev_spend
# Shapiro-Wilk normality test for revenue
shapiro.test(df_rev_spend$total_revenue) # => p-value < 2.2e-16
Shapiro-Wilk normality test
data: df_rev_spend$total_revenue
W = 0.93373, p-value < 2.2e-16
# Shapiro-Wilk normality test for spend
shapiro.test(df_rev_spend$spend) # => p-value < 2.2e-16
Shapiro-Wilk normality test
data: df_rev_spend$spend
W = 0.95587, p-value < 2.2e-16
# revenue
ggqqplot(df_rev_spend$total_revenue, ylab = "revenue")

# spend
ggqqplot(df_rev_spend$spend, ylab = "spend")

NA
NA
Since the p-value is practically zero, we can reject the null
hypothesis that it is normal distributed both variables
so we can use another test to check asociation Kendall rank
correlation test - Kendall’s tau
This is used to estimate a rank-based measure of association. This
test may be used if the data do not necessarily come from a bivariate
normal distribution.
The correlation coefficient between revenue and spend are 0.7209243
and the p-value < 2.2e-16. Exist correlation
but we know it is not causation
res2
Kendall's rank correlation tau
data: df_rev_spend$total_revenue and df_rev_spend$spend
z = 34.731, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.7209243
We know that they are highly correlated. We can use time series to
see the magnitude of the effect of marketing using it as regressor
Visually in the previous plot of revenue, we can see it is
stationary, and there is seasonality. We can use auto.arima to calculate
the regressor factor
This is a very basic analysis, but it helps to understand in overall
how the marketing efforts are creating an effect in the revenue.
1.489071 is the regressor factor
Interpretation of sales increase: For every $1,000 (the unit of ad
spend is in thousands USD) increase in advertising spend, the unit sales
increase by 1.4891 - It is a positive effect, so ACME probably is using
RECAST and the marketing campaign is performaing well
sales_increase
xreg
1.489071
---
title: "Assignment-RECAST Modern Marketing Mix Modeling "
output:
  html_document:
    df_print: paged
---

###########################################################################
#                                                                         #
# Purpose:       Revenue vs AdSpend effect                                #
#                                                                         #
# Author:        Gerardo Gandara                                          #
# Contact:       gerardo.gandara@gmail.com                                #
#                                                                         #
# Client:        andy@getrecast.com                                       #
#                                                                         #
# Code created:  2023-10-11                                               #
# Last updated:  2023-10-11                                               #
# Source:        https://github.com/ggandara13/myrepository               #
#                                                                         #
# Comment:       https://getrecast.com/modern-media-mix-modeling/.        #
#                                                                         #
###########################################################################


# Libraries
These are the libraries to load


```{r}
pkg <- c('pivottabler', 'tidyverse','tibbletime','anomalize','timetk')
#install.packages(pkg)

library(pivottabler)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tibbletime)
library(anomalize)
library(timetk)
library(forecast)

```



Read the revenue and AdSpend

We check the date type of the 'date' column and change the type to DATE

The CSV data is avaialble in the github repository


```{r}

setwd("/Users/gerardogandara/Documents/recast/")

## Read file as CSV
df_spend <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_spend.csv')
df_revenue <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_revenue.csv')

# Convert char format to DATE format
df_revenue$date <- as.Date(df_revenue$date, format =  "%m/%d/%y")
df_spend$date <- as.Date(df_spend$date, format =  "%m/%d/%y")

# Check the format and content of each dataframe
str(df_revenue)
str(df_spend)

```


Check there are not NULL values 

```{r}

paste("Total of NULL in Spend:", sum(is.na(df_spend)))

paste("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))

paste("\n\nSummary\n")
# you can check with summary as well and see the columns
summary(df_spend)
summary(df_revenue)

sum(is.na(df_spend))/nrow(df_spend)*100
sum(is.na(df_revenue))/nrow(df_revenue)*100


```
Since we have 1,035 rows, we can remove because that represents less than 1% or just replace with zero

Now, let's replace with zero since it is a time series
```{r}

df_revenue[is.na(df_revenue)] <- 0
df_spend[is.na(df_spend)] <- 0


cat("Total of NULL in Spend:", sum(is.na(df_spend)))
cat("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))


```




## 1- Which channel had the largest increase in spend in 2022 compared to the same date range in 2021?

Using lubridate to filter the year of the date, also use the pivottabler library to group/aggregate it

Answer: online_video


```{r}
yearfunction<-function(dataframe, datecolumn) {
    year(dataframe[,datecolumn])
}
df_spend$Year <- yearfunction(df_spend, "date")


df <- df_spend %>% group_by(channel) %>% 
  filter(lubridate::year(date) %in% c(2021, 2022) )
df <- as.data.frame(df)


# initialice the PIVOT
pt <- PivotTable$new()
# add the df with 2 years
pt$addData(df)
pt$addColumnDataGroups("Year")
pt$addRowDataGroups("channel")
pt$defineCalculation(calculationName="TotalSpend", summariseExpression="sum(spend)")
pt$evaluatePivot()

#covnert to dataframe
df1 <- pt$asDataFrame()
df_summary_final <- df1 %>%
  mutate(index_lift = ( (df1[,2]/ df1[,1]) - 1 ) *100  )

df_summary_final<- df_summary_final[order(df_summary_final$index_lift, decreasing = TRUE), ]


# print the first value of the dataframe
paste("The channel with highest LIFT is:", row.names(df_summary_final)[1] )


```




## 2- In terms of total revenue, are there any anomalous days?

Group by day to see the trend and plot


```{r}

df_revenue_total <- df_revenue %>%
  mutate(total_revenue = ( df_revenue$revenue_dtc + df_revenue$revenue_amazon + df_revenue$revenue_walmart  ))

# Group by sum using R Base aggregate()
agg_df <- aggregate(df_revenue_total$total_revenue, by=list(df_revenue_total$date), FUN=sum)
colnames(agg_df) <- c('date','revenue') 


#Plot sales over 36 months
ggplot(agg_df, aes(x=date, y = revenue)) + geom_line()

paste("Visually we can see some outliers")
```

Now, we can use the anomalties library to identify the values

Use time_decompose() to decompose a time series prior to performing anomaly detection with anomalize(). Typically, anomalize() is performed on the "remainder" of the time series decomposition.

The return has three columns: "remainder_l1" (lower limit for anomalies), "remainder_l2" (upper limit for anomalies), and "anomaly" (Yes/No).



```{r}
# Convert df to a tibble
df <- as_tibble(agg_df)
class(df)

df_anomalized <- df %>%
    time_decompose(revenue, method = "stl", merge = TRUE) %>%
    anomalize(remainder, method = "iqr", alpha = 0.05, max_anoms = 0.2) %>%
    time_recompose()

#We can then visualize the anomalies using the plot_anomalies() function. 
df_anomalized %>% plot_anomalies(ncol = 1, alpha_dots = 0.75)

```

We can print the detail of the anomalies, for example the last 4


```{r}
tail(df_anomalized %>%
  filter(anomaly == 'Yes'),4)
```

You can also use the timetk package for dynamic plot

```{r}
agg_df %>% timetk::plot_anomaly_diagnostics(date,revenue, .facet_ncol = 2)

```


## 3- In which month of the year does Acme tend to make the most revenue?

We can compute month column before we aggregate and sort to get the popular month

Most profitale month in a year is the month number: 3

```{r}

# Calculate the month column
monthfunction<-function(dataframe, datecolumn) {
    month(dataframe[,datecolumn])
}
agg_df$Month <- monthfunction(agg_df, "date")

# group by month

by_month <- aggregate(agg_df$revenue, by = list(agg_df$Month), FUN = sum)
colnames(by_month) <- c('month','revenue') 



by_month <- by_month[order(by_month$revenue, decreasing = TRUE), ]

# print the first value of the dataframe
print(by_month)
paste("Most profitale month in a year is the month number:", row.names(by_month)[1] )

```


## 4- Does Acme's marketing spend tend to follow a similar pattern to revenue?

We need to merge both dataframes to have the TS of Revenue using AdSped as regressor
Convert the dataframe to TS then plot, then plot the results

Visually we can confirm they have a similar pattern.

```{r}

#select the date and total_revenue of Acme
df_only_revenue <- select(df_revenue_total, date, total_revenue)

#aggregate spend of all media
df_spend_total <- aggregate(df_spend$spend, by=list(df_spend$date), FUN=sum)
colnames(df_spend_total) <- c('date','spend') 

df_rev_spend <- merge(df_only_revenue, df_spend_total, by = "date")


#Convert dataframe to time series object using the ts() function
acme_ts <- ts(data = df_rev_spend[,c(2,3)])


# Time plot of both variables
autoplot(acme_ts, facets = TRUE)

```






## 5- Based on what you see here, could you make the case that marketing spend is causally related to revenue? Why or why not?

Visually we can see the effect has high correlation, and when we calculate the R2 is 0.91 with p=value basically zero. So, it is statiscally correlated.


```{r}

library("ggpubr")

df_rev_spend

ggscatter(df_rev_spend, x = "spend", y = "total_revenue", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Total AdSpend", ylab = "Total Revenue - 3 sources ")

```
Is the relation linear? Yes, form the plot above, the relationship is linear.

now we jsut need to verify that the data from each of the 2 variables (revenue, spend) follow a normal distribution.

```{r}

df_rev_spend


# Shapiro-Wilk normality test for revenue
shapiro.test(df_rev_spend$total_revenue) # => p-value < 2.2e-16

# Shapiro-Wilk normality test for spend
shapiro.test(df_rev_spend$spend) # => p-value < 2.2e-16


# revenue 
ggqqplot(df_rev_spend$total_revenue, ylab = "revenue")
# spend
ggqqplot(df_rev_spend$spend, ylab = "spend")
```

Since the p-value is practically zero, we can reject the null hypothesis that it is normal distributed both variables

so we can use another test to check asociation
Kendall rank correlation test - Kendall’s tau 

This is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.

The correlation coefficient between revenue and spend are 0.7209243 and the p-value < 2.2e-16.
Exist correlation

but we know it is not causation

```{r}
res2 <- cor.test(df_rev_spend$total_revenue, df_rev_spend$spend,  method="kendall")
res2
```


We know that they are highly correlated. We can use time series to see the magnitude of the effect of marketing using it as regressor

Visually in the previous plot of revenue, we can see it is stationary, and there is seasonality. We can use auto.arima to calculate the regressor factor

This is a very basic analysis, but it helps to understand in overall how the marketing efforts are creating an effect in the revenue.

1.489071 is the regressor factor

#### Interpretation of sales increase: For every $1,000 (the unit of ad spend is in thousands USD) increase in advertising spend, the unit sales increase by 1.4891 - It is a positive effect, so ACME probably is using RECAST and the marketing campaign is performaing well



```{r}
# we need to convert the daily data to weekly
weekno <- as.numeric(df_rev_spend$date - df_rev_spend$date[1]) %/% 7
Week <- df_rev_spend$date[1] + 7 * weekno
df_by_week <- aggregate(cbind(total_revenue,spend)  ~ Week, df_rev_spend, sum)

#to check it looks good
df_by_week

#Convert dataframe to time series object using the ts() function
acme_ts_week <- ts(data = df_by_week[,c(2,3)])
# Time plot of both variables
autoplot(acme_ts_week, facets = TRUE)

# Fit ARIMA model
fit <- auto.arima(acme_ts_week[, "total_revenue"], xreg = acme_ts_week[, "spend"], 
                  stationary = TRUE, seasonal=TRUE, 
                  allowdrift = FALSE, allowmean = FALSE )#,lambda = Lambda)

fit

#The auto.arima function has fit a linear regression model to the advertising variable and an an ARIMA model to the time series (date) variable. ARIMA(2,0,1)

#What is the increase in sales for each unit increase in advertising?
sales_increase <- coefficients(fit)[4]
sales_increase




#Check Residuals to verify that the residuals are white noise and normal distributed - Indeed they are so the model is acceptable
checkresiduals(fit)

```




Comment: https://getrecast.com/modern-media-mix-modeling/.